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Summary 

The degree of oxidation of carbon atoms in organic molecules depends on the covalent structure. 
In proteins, the average oxidation state of carbon (Zc) can be calculated as an elemental ratio from 
the chemical formula. To investigate oxidation-reduction (redox) patterns, groups of proteins from 
different subcellular locations and phylogenetic divisions were selected for comparison. Extracellu- 
lar proteins of yeast have a relatively high oxidation state of carbon, corresponding with oxidizing 
conditions outside of the cell. However, an inverse relationship between Zc and redox potential oc- 
curs between the endoplasmic reticulum and cytoplasm; this trend is interpreted as resulting from 
overall coupling of protein turnover to the formation of a lower glutathione redox potential in the cy- 
toplasm. In Rubisco homologues, lower Zc tends to occur in organisms with higher optimal growth 
temperature, and there are broad changes in Zc in whole-genome protein compositions in microbes 
from different environments. Energetic costs calculated from thermodynamic models suggest that 
thermophilic organisms exhibit molecular adaptation to not only high temperature but also the reduc- 
ing nature of many hydrothermal fluids. A view of protein metabolism that depends on the chemical 
conditions of cells and environments raises new questions linking biochemical processes to changes 
on evolutionary timescales. 

Keywords: oxidation state, redox potential, subcellular location, protein metabolism, protein evolution 
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22 1 Introduction 

23 Chemical reactions involving the transfer of electrons, known as oxidation-reduction or redox reactions, 

24 are ubiquitous in cellular and environmental systems [1,2]. In the cell, the oxidation of thiol groups in 

25 proteins to form disulfides has the potential to regulate (activate or inhibit) enzymatic function [3] . Be- 

26 cause these reactions are reversible on short timescales, a regulatory network known as redox signalling 

27 is made possible by reactions of small-molecule metabolites, including glutathione (GSH) and reactive 

28 oxygen species [4]. On timescales of metabolism, complex oxidation-reduction reactions are required for 

29 the formation (anabolism) and degradation (catabolism) of proteins and other biomolecules. Although 

30 many individual steps in biomass synthesis are irreversible, much biomass is ultimately recycled through 

31 endogenous metabolism [5]. On longer timescales, forces outside of individual cells and organisms sus- 

32 tain the redox disequilibria between inorganic and/or organic species that provide the energy source for 

33 metabolisms suited to a multitude of environments [6]. In turn, the actions of organisms can alter the 

34 redox conditions on Earth; the oxygenation of the atmosphere and oceans over geological time has a 

35 biogenic origin, and changed the course of later biological evolution [7]. 

36 Through evolution, the sequences of genes, and their protein products, are progressively altered. The 

37 elemental stoichiometry (chemical formula) and standard Gibbs energy of the molecules have a primary 

38 impact on metabolic requirements for energy and elemental resources. The energetic cost for synthesis 

39 of biomass is a function not only of the composition of the biomass, but also of environmental parame- 

40 ters including temperature and the concentrations of metabolic precursors. Temperature and oxidation- 

41 reduction potential have profound effects on the relative energetic costs of formation of different amino 

42 acids [8] or proteins [9]. These energetic costs are sensitive to differences in the elemental compositions 

43 of biomolecules. To a first approximation, a shift to a more reducing environment alters the energetics 

44 of reactions in a direction that favours the formation of relatively reduced chemical compounds. In a 

45 field test of this principle, metagenomic sequences for the most highly reduced proteins were found in 

46 the hottest and most reducing zones of a hot spring [10, 1 1]. 

47 The purpose of this study is to investigate a particular stoichiometric quantity, the average oxidation state 

48 of carbon (Zc, defined below), as a comparative tool for identifying compositional patterns at different 

49 levels of biological organization. By comparing a quantity derived from the elemental compositions of 
so proteins, this study addresses one aspect of biochemical evolution. However, the questions raised here 

51 differ in important respects from conventionally defined biochemistry and evolutionary biology. Bio- 

52 chemical studies are most often concerned with the functions of molecules [12], including enzymatic 

53 catalysis and non-covalent interactions involved in the structural conformation of proteins and binding of 

54 ligands. Studies in molecular evolution often place emphasis on the historical relationships between se- 

55 quences, but not their physical properties [12]. Combining these viewpoints, most current work assumes 

56 that structural stability of proteins is the primary criterion for molecular adaptation to high tempera- 

57 ture [13]. In contrast, in this study, more attention is given to the stoichiometric and energetic demands 

58 of the reactions leading to protein formation. Because material replacement of proteins depends on 

59 metabolic outputs [14], the compositional differences among proteins have significant consequences for 
eo cellular organization and metabolism. 

ei The following questions have been identified: 1) How does the relationship between Zc of amino acids 

62 and corresponding codons relate to the origin or form of the genetic code? 2) How do the differences in 

63 Zc between membrane proteins and others compare with properties of amino acids, e.g. hydrophobicity, 

64 known to favour localization to membranes? 3) How are the differences in Zc of proteins among eu- 

65 karyotic subcellular compartments related to differences in redox potential? 4) How are the differences 
ee in Zc in families of redox-active proteins related to the standard reduction potentials of the proteins? 5) 
67 How are the differences in Zc among different organisms, both in terms of bulk (genome-derived) pro- 
es tein composition and for homologues of a single family (Rubisco), related to environmental conditions, 
69 especially temperature and redox potential? 
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70 In the Results each problem is briefly introduced, the empirical distribution of Zc is described, and a 

71 discussion is developed to explore how the patterns reflect biochemical and evolutionary constraints. 

72 These short discussions, corresponding to topics (l)-(4), should be regarded as preliminary, and prob- 

73 ably incomplete interpretations. The discussions are limited because there is no complete conceptual 

74 framework that links the biochemical reactions with the evolutionary processes that are implicit in all of 

75 the comparisons. The final section of the Results goes into more detail for the phylogenetic comparisons 

76 (topic 5) by examining the relative Gibbs energies of formation of proteins in environments of differing 

77 redox potential. 



2 Methods 



79 Throughout this study, "reducing" and "oxidizing" are used in reference to oxidation-reduction potential, 

so tied to a particular redox couple or to environmental conditions, often expressed as millivolts on the Eh 

si scale. "Reduced" and "oxidized" are used to refer to variations in the oxidation state of carbon. 

82 The formal oxidation state of a carbon atom in an organic molecule can be calculated by counting -1 for 

83 each carbon-hydrogen bond, 0 for each carbon-carbon bond, and -1 for each bond of carbon to O, N or 

84 S [8, 15]. In photosynthesizing organisms, and autotrophs in general, the carbon source is CO2, having 
as the highest oxidation state of carbon (+4). The products of photosynthetic reactions include proteins and 
se other biomolecules with a lower oxidation state of carbon. Even if the molecular structure is unknown, 
87 analytical elemental compositions can be used in calculations of the average oxidation state of carbon in 
ss biomass [16, 17]. Because any gene or protein sequence corresponds to a definite canonical (nonionized, 

89 unphosphorylated) chemical formula, the average oxidation state of carbon in these biomolecules is 

90 easily calculated. 

91 In amino acids and proteins, the average oxidation state of carbon (Zc) can be calculated using 

-m h + 3« N + 2«o + 2rc s + Z 

Zc = , (1) 

n c 

92 where Z is the charge on the molecule, and nc, «h> wn> no and n$ are the numbers of the subscripted 

93 elements in the chemical formula of the molecule. The coefficients on the terms in the numerator derive 

94 from formal charges of atoms other than C, as follows: H (+1), N (-3), O (-2), S (-2). Negative formal 

95 charges reflect greater electronegativities of these elements compared to carbon. If two thiol groups react 

96 to form a disulfide bond, the oxidation states of the two affected sulfur atoms change from -2 to -1. 

97 Although H2 is produced in this reaction, the oxidation state of carbon in the protein remains constant. 

98 It follows that equation (1) is applicable only to chemical formulas of proteins in which the N, O, and S 

99 are all fully reduced (bonded only to H and/or C). 

100 The Z in equation (1) ensures that ionization by gain or loss of a proton, having an equal effect on Z and 

101 nu, does not change the Zc- Likewise, gain or loss of H2O, which affects equally the values 2nu and 

102 no, does not alter the average oxidation state of carbon [15]. Accordingly, the Zc of a peptide formed 

103 by polymerization of amino acids (a dehydration reaction) is a weighted average of the Zc in the amino 

104 acids, where the weights are the number of carbon atoms in each amino acid. As an example, the Zc of 

105 hen egg white lysozyme, having a chemical formula of C613H959N193O185S10, is 0.016. This protein is 

106 oxidized compared to many other proteins, which commonly have negative values of Zc- 

107 To aid in reproducibility, data files of protein sequences or amino acid composition, except large files 

108 available from public databases, and computer program files for the calculations are provided in the 

109 Supporting Information. The calculations and figures were generated using the R software environment 
no [18] together with the CHNOSZ package [19]. 
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Z c in first two bases of RNA codon hydropathy index of amino acid 



Figure 1: Average oxidation state of carbon (Zc) in amino acids compared with (a) Zc in first two 
bases of the corresponding RNA codons and (b) hydropathy index of the amino acids taken from Kyte 
and Doolittle, 1982 [24]. Standard one-letter abbreviations for the amino acids are used to identify the 
points. In (a), the different codon compositions for serine (S) and arginine (R) are indicated by letters 
below the symbols, and some amino acid labels are shifted for readability. In (b), labels for asparagine 
(N) and glutamine (Q) are omitted for clarity; they plot at the same positions as aspartic acid (D) and 
glutamic acid (E), respectively. 

in 3 Results 

n2 3.1 Comparison of Zc of amino acids with hydropathy and properties of codons 

n3 In contemplating the ancient origin of the genetic code, the chemical similarities of respective codons and 
ii4 amino acids have been used to argue for coevolution (shared biosynthetic pathways) [20] or a tendency to- 
ns ward similar physicochemical properties. The possible advantages that were identified for similar physic - 
n6 ochemical properties include enhancing the steric interactions between amino acids and codons [21], or 
ii7 increasing the similarity between different amino acids resulting from a single DNA base mutation in 
us order to maintain protein structure [20]. 

us In the genetic code, the first two bases (a "doublet") are more indicative of the amino acid than the 

120 third position of the codon [21]. The Zc of amino acids are compared with the values calculated for the 

121 corresponding RNA nucleobase doublets in figure la. Some of the doublets, e.g. UU (phenylalanine, 

122 leucine), CU (leucine), UC (serine) and CC (proline) have identical Zc (in this case, 1.5), leading to 

123 only 5 possible values of Zc for the doublets. The overall relationship suggested by figure la is loose 

124 correlation between the Zc of amino acids and of the RNA doublets. The most highly reduced amino 

125 acids, leucine (L) and isoleucine (I), are coded for by doublets having the two lowest Zc values. 

126 The increase in Zc going from leucine to alanine to glycine (figure 1) is reflected in the metastability fields 

127 of these amino acids, which occur in order of increasing oxidation potential, or oxygen fugacity [22]. 

128 Metastable equilibrium refers to the equalization of the energies of reactions to form the amino acids; 

129 it is a partial equilibrium because the amino acids generally remain unstable with respect to inorganic 

130 species. Likewise, the relative Gibbs energies of formation reactions of amino acids differ considerably 

131 between hydrothermal (hot, reducing) and surface (cool, oxidizing) environments [8]. These patterns 
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Figure 2: Average oxidation state of carbon shown in histograms and normal probability plots for (a-b) 
all human proteins and (c-d) human membrane proteins. Only proteins of sequence length greater than 
or equal to 50 amino acids are considered. In the normal probability plots the lines are drawn through 
the 1st and 3rd quartiles, indicated by the crosses. 



132 support the possibility of metastable equilibria in hydrothermal environments between amino acids and 

133 nucleobase sequences that are paired in the genetic code. Tests of the potential for these states, carried 

134 out using Gibbs energies available at high temperature [9,23], could reveal thermodynamic constraints 

135 on the energetics of abiotic or early biosynthesis independent of the arguments based on similarities in 

136 protein structure and biosynthetic pathways [20,21]. 

137 The hydropathy index, based on the relative hydrophobicity and hydrophilicity of amino acids [24], 

138 is commonly used for identifying probable membrane-spanning domains of proteins. In figure lb, Zq 

139 is compared with the hydropathy values for individual amino acids. The three most hydropathic amino 

140 acids, isoleucine, leucine and valine, are also the three with the lowest Zq- Therefore, membrane proteins 

141 with hydrophobic domains are likely to be more reduced than other proteins. The following sections 

142 examine the actual differences in human and yeast proteins. 



143 3.2 Differences in Zq of membrane proteins 

144 The lipid (fatty acid) components of membranes are reduced relative to many other biomolecules, in- 

145 eluding amino acids, nucleotides, and saccharides (see figure 1 of Amend et ai, 2013 [25]). Proteins that 

146 are embedded in membranes tend to contain more hydrophobic amino acids, which enhance solubility 

147 of proteins in the membrane environment [24] and generally are relatively reduced (figure lb). 

148 To compare membrane proteins with other human proteins, sequences for all human proteins were taken 

149 from the UniProt database [26], and sequences for predicted membrane proteins were taken from all 

150 FASTA sequence files provided in Additional File 2 of Almen et ai, 2009 [27]. Only sequences at 

151 least 50 amino acids in length were considered. The distribution of Zq of all human proteins (figure 2a; 
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152 «=83994) is centred on -0.123 (median), -0.120 (mean). In the Zc of human membrane proteins, the 

153 distribution is shifted to lower values (figure 2c; «=6627, -0.186 median, -0.189 mean). The mean value 

154 is lower for membrane proteins than for all human proteins (Student's f-test: p < 2.2 x 10~ 16 ). Thus, 

155 the proteins located in the membranes are, on average, more reduced than other proteins in humans. It is 

156 tempting to speculate that the coexistence of reduced proteins with other relatively reduced biomolecules 

157 (lipids) reflects a compositional similarity that would contribute to energy optimization if metabolic 

158 pathways for proteins and lipids were operating under common redox potential conditions. 

159 The observed distributions of Zq are each compared with a normal distribution in normal probability 

160 plots (theoretical quantile-quantile (Q-Q) plots) in figure 2b,d. The steeper trends in the low- and high- 

161 quantile range of figure 2b indicate that the distribution of Zc of human proteins has relatively long 

162 tails, especially at high Zc, compared to a normal distribution. Although an asymmetry is apparent in 

163 the uneven shape of the histogram in figure 2c and the wiggles in figure 2d, the overall distribution of 

164 Zc of the membrane proteins more closely resembles a normal distribution. Comparisons with normal 

165 distributions have implications, through the central limit theorem [28], for assessing the impact of many 
lee small-scale, independent effects in evolution on the chemical composition of organisms or their com- 
167 ponents. This theme, however, is not developed further here; instead, the overall differences in Zc of 
lea proteins in subcellular compartments are considered next. 

169 3.3 Subcellular differences in Zc of proteins and comparison with subcellular redox po- 

170 tential 

171 For some model organisms, including Saccharomyces cerevisiae (yeast), the identities of proteins as- 

172 sociated with subcellular compartments are now available in databases. Here, calculations of Zq of 

173 proteins and a comparison with independent measurements of redox potential are used to investigate the 

174 oxidation-reduction features and dynamics of cellular structure. 

175 In a previous study, the limiting conditions for chemical transformations among proteins in subcellular 

176 compartments were quantified theoretically as a function of redox potential and hydration state [29]. In 

177 that study, the locations of proteins were taken from the the "YeastGFP" study of Huh et al, 2003 [30]. 

178 That dataset has the advantage that relative abundances of many of the proteins are available, but it is 

179 limited to 23 named locations in the cell. In order to consider more cellular components (including the 
lso membranes), a more extensive reference proteome is used in this study. This proteome is based on the 
lsi current Saccharomyces Genome Database (SGD) [31] annotations combined with the Gene Ontology 

182 (GO) [32] vocabulary for the "cellular component" aspect, which describes many organelles and mem- 

183 branes within the cell. Major cell components were selected for comparison, and the Zc was calculated 

184 for protein products of the genes, as summarized in table 1. The median values are also portrayed in the 

185 drawing of a yeast cell in figure 3. 

186 It is apparent that the membrane proteins are highly reduced. However, not all membranes are equal; 

187 the proteins in the nuclear and inner and outer mitochondrial membranes are less reduced than those in 

188 the plasma membrane, and the endoplasmic reticulum (ER) membrane has very highly reduced proteins. 

189 Among the organellar proteins considered, the ER has the most reduced proteins, followed by vacuoles 

190 and mitochondria; the cytoplasmic proteins are moderately reduced. The proteins in the nucleus, bud 

191 neck and bud are more oxidized than in the other compartments. The most oxidized proteins in the system 

192 are the extracellular ones. The relative Zc of the proteins in the ER, mitochondrion, cytoplasm, and 

193 nucleus are consistent with the previous study in which the calculations took account of the abundances 

194 of proteins [29] . 

195 For comparison with Zc of proteins, the values of subcellular redox potential (Eh, in mV) listed in table 
we 2 were compiled from literature sources [1, 33-40]. Measurements of reduced and oxidized glutathione 
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Table 1: Summary of Zc of proteins in subcellular locations of yeast. Numbers of proteins (n) in SGD 
associated with the indicated GO terms are listed. The numerators of the fractions denote membrane- 
associated proteins that are also listed as "integral to membrane" (GO:0016021); only these proteins 
were used in the calculations of Zq. 



cellular component 


GO term 


n 


median Zc 


mean Zc 


cytoplasm 


GO:0005737 


2245 


-0.136 


-0.127 


nucleus 


GO:0005634 


2073 


-0.129 


-0.121 


mitochondrion 


GO:0005739 


1077 


-0.164 


-0.159 


endoplasmic reticulum 


GO:0005783 


435 


-0.191 


-0.192 


nucleolus 


GO:0005730 


263 


-0.137 


-0.128 


Golgi apparatus 


GO:0005794 


215 


-0.160 


-0.167 


cellular bud neck 


GO:0005935 


153 


-0.111 


-0.108 


vacuole 


GO:0005773 


175 


-0.164 


-0.163 


extracellular region 


GO:0005576 


95 


-0.096 


-0.098 


cellular bud tip 


GO:0005934 


96 


-0.113 


-0.110 


endoplasmic reticulum membrane 


GO:0005789 


283/338 


-0.209 


-0.211 


plasma membrane 


GO:0005886 


224/427 


-0.200 


-0.188 


mitochondrial inner membrane 


GO:0005743 


143/218 


-0.184 


-0.184 


vacuolar membrane 


GO:0005774 


100/145 


-0.205 


-0.196 


Golgi membrane 


GO:0000139 


76/121 


-0.203 


-0.200 


nuclear membrane 


GO:0031965 


54/67 


-0.161 


-0.139 


mitochondrial outer membrane 


GO:0005741 


51/92 


-0.176 


-0.177 



197 (GSH and GSSG) in whole-cell extracts have been interpreted as reflecting cytoplasmic redox potential, 

198 but redox-sensitive green fluorescent protein (roGFP) probes [33] provide more specific data for subcel- 

199 lular locations. The data are not in all cases acquired from yeast, but it has been noted that cytoplasmic 

200 Eh values based on roGFP are similar for different model organisms [36]. 

201 The redox potentials in the vacuole and extracellular space are less well constrained than other locations. 

202 Under stress response, high amounts of GSSG, but not GSH, are sequestered in vacuoles [41]. A con- 

203 servative lower range for the Eh of vacuoles (-160 to -130 mV) was calculated by taking a value of 80% 

204 GSSG and computing Eh from the GSH-GSSG equilibrium at concentrations of 1-10 mM GSH (see 

205 equation 21 and figure 4 of Ref. [35]). The redox potential would be higher if the GSSG/GSH ratio were 

206 in fact greater than 80/20. A high redox potential is also implicated by the presence of ferric iron species 

207 in vacuoles [42]. Extracellular redox state can vary greatly, but in aerobic organisms and laboratory 

208 culture it is likely to be generally oxidizing compared to subcellular compartments. 

209 The values in table 2 are not comprehensive, and should be taken as a rough guide, but even with the 

210 uncertainties, comparison with the interquartile range of Zc of the proteins reveals some trends (figure 

211 4a). The difference in both Zc and Eh is positive going from any subcellular compartment, except for 

212 vacuoles, to the extracellular space. This pattern has an intuitive explanation: by evolutionary adjustment 

213 to optimize proteins for their environment, the inside of the cell, which is more reducing, would be 

214 expected to have more reduced proteins compared to the outside. 

215 The more surprising trend in figure 4a is an inverse relationship between Zc and Eh of the cytoplasm 

216 and ER. Does this contrast have any biochemical significance? The ER is a component of the secretory 

217 pathway, which transports proteins to membranes and to outside the cell [38]. Let us conjecture that the 

218 populations of proteins in the ER and cytoplasm are connected through common metabolic intermediates 

219 - their formation and degradation are part of the recycling of biomass through endogenous metabolism 

220 [5], also implied by metabolic closure [14]. It follows that the formation of proteins of a higher Zc in the 

221 cytoplasm entails the loss of electrons from proteins into metabolic pathways. Perhaps these pathways 

222 ultimately transfer these electrons to the formation of GSH in the cytoplasm. 
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Z c 




Figure 3: A schematic drawing of a yeast cell showing median values of the average oxidation state of 
carbon in proteins from selected subcellular locations listed in table 1 . All cellular components listed 
in the table are represented in the drawing. The colour scale is adjusted so that the cytoplasm has a 
neutral hue (white), and locations with relatively oxidized and reduced proteins are depicted by blue 
and red colours, respectively. Darker red colours are used for the more reduced groups of proteins in 
some of the membranes such as the ER, Golgi, vacuolar and plasma membranes. The nuclear and inner 
and outer mitochondrial membranes are shown with lighter reds because their proteins are relatively 
oxidized compared to those in the other membranes. Components not separated by membranes (or with 
membranes not shown) include the nucleolus, bud neck and bud tip. 



Table 2: Values of Eh compiled from literature sources, for yeast cells or culture except as noted. The 
ranges account for variation among different cell types, experimental techniques and published values, 



and are used to construct fi 


gure 4. 




location 


range (mV) 


references 


mitochondrion 


-360 to -255 


Ref. [33]: roGFP probe (-360 mV; human HeLa). 

Ref . [34] : rxYFP in matrix (-296 mV) and intermembrane space (-255 

mV). 


cytoplasm 


-320 to -240 


Ref. [35]: GSH in proliferating mammalian cells (-240 mV). 
Ref. [36]: roGFP probe of GSH (-320 mV). 


endoplasmic reticulum 


-208 to -133 


Ref. [37]: NYTC peptide (-185 to -133 mV; murine hybridoma 
CRL-1606). 

Ref. [38]: roGFP (-208 mV; human HeLa) 


vacuole 


-160to>-130 


See text. 


extracellular 


-150 to >160 


Ref. [1]: Aerobic (160 mV) and anaerobic (90 mV) cultures. 
Ref. [39]: Very high-gravity fermentation (-150 mV). 
Ref. [40]: H. sapiens plasma (-140 mV). 
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Figure 4: The plot (a) compares average oxidation state of carbon in proteins from different subcellular 
locations of yeast with Eh values taken from glutathione (GSH-GSSH) or other redox indicators (see 
table 2). The heights of the boxes indicate the interquartile ranges of Zc values, and the widths represent 
the ranges of Eh listed in table 2. The scheme (b) invokes electron transfer to account for the contrasts 
in redox potential of GSH/GSSG and Zc of proteins between ER and cytoplasm (see text). 



223 This scheme is depicted in figure 4b. The vertical dashed line represents a physical (but not imperme- 

224 able) boundary between ER and cytoplasm; the curved lines represent reactions and transport within 

225 glutathione and protein systems that cross the compartments, and the arrow represents a linkage be- 

226 tween glutathione and protein systems, which is effectively a coupled oxidation-reduction reaction. The 

227 scheme represents a redox mass-balance interpretation of the overall stoichiometric relationships, not a 

228 mechanism applied to individual GSH or protein molecules; the complete picture of metabolic connec- 

229 tivity in the cell is certainly more complex. Note that this scheme refers to the relative oxidation states 

230 of carbon of the proteins, not the oxidation of protein thiol groups to form disulfides. Disulfide bond 

231 formation takes place during folding and secretion of proteins, and may also contribute to glutathione 

232 metabolism [4]. Although there is growing detail of the pathways of glutathione metabolism in the cell, 

233 including compartmentalization between ER and cytoplasm, it is not known how they connect with non- 
234 thiol systems (e.g. [40]). Integrating the oxidation-reduction requirements of protein metabolism into 

235 existing metabolic models may help to complete the balance sheet of redox interactions in the cell. 

236 Experiments on the connections between redox conditions and protein metabolism at the subcellular level 

237 can help elucidate the possible effects of coupling of protein metabolism to the glutathione redox system. 

238 If the net transfer of high-Zc proteins to the cytoplasm is stopped, then a decrease in GSH/GSSG redox 

239 potential in the ER relative to the cytoplasm would be expected based on the scheme shown in figure 4. 

240 This prediction is consistent with the outcome of experiments showing that puromycin-induced halting of 

241 protein synthesis causes a decrease in the redox potential monitored by roGFP in the ER [43]. A further 

242 untested implication of this hypothesis is that in ER-stress experiments the Zc of the protein population 

243 in the ER would increase. Metabolism of proteins might also interact with redox pathways other than 

244 the oxidation and reduction of glutathione. A linkage of this type has been documented in plants, where 

245 degradation of aromatic and branched-chain amino acids was identified as a source of electrons for the 

246 mitochondrial electron transport chain [44]. 
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Figure 5: Average oxidation state of carbon in proteins compared with standard reduction potentials for 
ferredoxin (FER1), ferredoxin/thioredoxin reductase (FTRV:FTRC1 dimer), thioredoxin/ (TRXF) and m 
(TRXM) from spinach [47] and thioredoxin (Trx), glutaredoxin 1 (Grxl) and 3 (Grx3), protein disulfide 
isomerase (PDI), and fhiohdisulfide interchange protein A (DsbA) and C (DsbC) from E. coli [48]. 

247 3.4 Comparison of Zc with standard reduction potentials of proteins 

248 In plants, the oxidation and reduction of the iron-sulfur cluster in ferredoxin and the thiol/disulfide groups 

249 in ferredoxin-thioredoxin reductase and thioredoxin are coupled to form the ferredoxin/thioredoxin sys- 

250 tern [45] ("system" here refers to the set of interacting proteins, and is not a system in the thermodynamic 

251 sense). Ferredoxin has the lowest standard reduction potential (midpoint potential, E° at 25 °C) in this 

252 system, and its iron-sulfur cluster is reduced by light energy through photosystem I. The oxidation of 

253 ferredoxin coupled to the reduction of thioredoxin is catalysed by ferredoxin/thioredoxin reductase. Re- 

254 duced thioredoxin can disseminate the redox signal via reduction of disulfide groups in other proteins, 

255 activating their enzymatic functions. Glutaredoxins are another group of thiol/disulfide proteins that also 

256 interact with disulfide bonds in proteins and, unlike thioredoxin, are reduced by glutathione [46]. 

257 The standard reduction potentials of proteins in the ferredoxin/thioredoxin system in spinach [47] and 

258 of thioredoxin and the glutaredoxin system in E. coli [48] are compared with Zc in figure 5. In the 

259 ferredoxin-thioredoxin chain of spinach, there is a strong decrease in Zc with increasing E° . In the 

260 glutaredoxin system of E. coli and the associated proteins protein disulfide isomerase (PDI) and fhiokdisulfide 

261 interchange protein (DsbA), there is a smaller decrease in Zc with increasing E° . Thioredoxin in E. coli, 

262 which has Zc and E° that are similar to thioredoxin in spinach, does not follow the trend apparent in the 

263 glutaredoxin and related proteins. 

264 From figure 5 it appears that the Zc of proteins involved in some parts of the redox signalling networks 

265 are inversely correlated with their standard reduction potential. Do systematic changes in amino acid 

266 composition implied by Zc impact the chemistry of the active site? The atomic environment surrounding 

267 the redox-active sites affects reduction potential [49], and changing very few residues in the vicinity of 
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268 the active site can affect the function [50]. However, these or similar proximal effects may not provide 

269 a complete explanation for the trends in Zq in figure 5, which apply to the entire protein sequences. A 

270 speculative explanation is offered here. In general, any oxidation of a protein molecule may involve loss 

271 of an electron from the active site or from a covalent bond distal to the active site. In proteins with higher 

272 Zc, the degree of oxidation of amino acids is greater, making the further loss of electrons from covalent 

273 bonds more difficult. The covalent oxidation of proteins ultimately leads to their degradation, disrupting 

274 the function of redox signalling networks. Therefore, it may be advantageous for low-is 0 active sites 

275 (those that have a greater potential to lose electrons on signalling timescales), to be associated with high- 

276 Zc proteins (those in which the covalent bonds have lost a greater number of electrons on evolutionary 

277 timescales). The break in the pattern by thioredoxin in E. coli, and the scattered distribution of Zc and 

278 E° of many other redox-active proteins that are not shown, presumably indicate that these hypothetical 

279 relations are applicable only to closely interacting chains of redox-active proteins and not the entire cell. 

280 There are dual hypotheses here: first, that high-Zc proteins have a lower tendency for irreversible oxida- 

281 tive degradation, and second, that reversible reduction potential and tendency for irreversible oxidation, 

282 which are different biochemical properties of the same molecule, are jointly tuned by evolution. One 

283 implication of the first hypothesis is that thioredoxin in spinach, which has a relatively low Zc, would be 

284 more easily covalently oxidized, and therefore may have a higher turnover rate than ferredoxin. 

285 3.5 Phylogenetic variation in Zc of proteins and comparison with optimal growth tem- 

286 perature 

287 A comparison of Zc of the combined proteins from selected microbial genomes is shown in figure 6. The 

288 sets of proteins shown on the left-hand side of figure 6 correspond to those organisms whose scientific 

289 names contain the indicated substring. In many cases, the names of the organisms reflect their envi- 

290 ronments and/or metabolic strategies. Examples of the matching genus names are Natronobacterium, 

291 Haloferax, Rhodobacter, Acidovorax, Methylobacterium, Chlorobium, Nitrosomonas, Desulfovibrio, 

292 Geobacter, Methanococcus, Thermococcus, Pyrobaculum, Sulfolobus. Most terms, however, match 

293 more than one genus (e.g. Pyro baculum and Pyroc occus). On the right-hand side of figure 6 are shown 

294 genera containing many groups with clinical and technological relevance; by the numbers of points it is 

295 apparent that their representation in RefSeq is greater than that of the environmental microbes. 

296 A general trend toward lower Zc in proteins in organisms from hot environments (e.g. represented by 

297 Thermo and Pyro in the names) is apparent. Organisms with the highest Zc inhabit saline evaporative wa- 

298 ters (Natr, Halo), while other aquatic organisms (e.g. Rhodo) have less highly oxidized proteins. Within 

299 a given genus (right-hand side of plot), the clusters of Zc tend to be tighter, reflecting conserved compo- 

300 sitional trends. Streptomyces , common in soils, has the highest Zc of the genera shown here. Buchnera 

301 is notable because its proteins are highly reduced, including one example {Buchnera aphidicola BCc) 

302 with the lowest Zc of proteins within the entire dataset. B. aphidicola BCc is the primary endosymbiont 

303 of the cedar aphid (Cinara cedri) and, at the time of sequencing, had one of the smallest known bacterial 

304 genomes [51]. Being relatively closely related [52], Mycoplasma and Clostridium also have relatively 

305 low Zc- Mycoplasma are known for their small genomes and dependence on metabolic products of the 

306 host; the low Zc may be a constraint imposed by growth in reducing intracellular or intraorganismal 

307 environments. 

308 We now turn to a comparison of homologues of a specific protein. Rubisco is an essential enzyme 

309 for carbon fixation. Sequence comparison of homologues (related sequences that appeal - in different 

310 organisms) has provided the basis for many phylogenetic studies [53]. Major divergent forms of the 

311 enzyme are Forms I and II, found in aerobic organisms, and Form III and "Rubisco-like proteins", found 

312 in anaerobic organisms [54]. The organisms listed in table 3 have in common the occurrence of Rubisco 

313 in their genome. Forms I, II or III were included in this comparison, but Rubisco-like proteins were 
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Figure 6: Average oxidation state of carbon in total combined proteins from sequenced microbial 
genomes. Sequences of microbial proteins were taken from NCBI RefSeq release 61. Only organisms 
with total sequenced protein length greater than 100,000 amino acids were used, leaving 6323 organisms. 
The group on the left-hand side is identified by substring matches in the scientific name of the organism; 
the terms were chosen to emphasize environmental variation. The group on the right-hand side consists 
of the indicated genera, emphasizing organisms of clinical and biotechnological relevance. The final cat- 
egory represents total proteins from all microbial genomes meeting the minimal size requirement, many 
of which are not shown in the other categories. 
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Table 3: Names of species, optimal growth temperatures (r opt ) and UniProt [26] accession numbers (IDs) 
for the large subunit of ribulose bisphosphate carboxylase (Rubisco). Literature references for T opt are 
indicated in brackets. Abbreviations: A - Archaea; B - Bacteria; E - Eukaryota. The numbers are used 
to identify the points in figure 7 (duplicated numbers occur in different temperature ranges). 



number 


domain 


species 


T °P 
^opb v - 


reference 


ID 


1 


b 


Phaeocystis antarctica 


0-6 


ci 

[55J 


Cj9rlD8 


2 


B 


Octadecabacter antarcticus 307 


4-10 


[56] 


M9R7V1 


3 


A 


Methanolobus psychrophilus 


18 


[57] 


K4MAK9 


4 


A 


Methanococcoides burtonii 


23 


[58] 


Q12TQ0 


5 


B 


Brady rhizobium japonicum 


25-30 


[59] 


Q9ZI34 


6 


B 


Thiobacillus ferrooxidans 


28-30 


[60] 


P0C916 


7 


E 


Zea mays 


30 


[61] 


P00874 


8 


B 


Mariprofundus ferrooxydans 


30 


[62] 


Q0EX22 


9 


B 


Desulfovibrio hydrothermalis 


35 


[63] 


L0RHZ1 


1 


A 


Methanosarcina acetivorans 


35^-0 


[64] 


Q8THG2 


2 


B 


Acidithiobacillus caldus 


45 


[65] 


F9ZLP0 


3 


E 


Cyanidium caldarium 


45 


[66] 


P37393 


4 


B 


Sulfobacillus acidophilus 


45-50 


[67] 


P72383 


5 


B 


Pseudomonas hydrogenothermophila 


52 


[68] 


Q51856 


6 


B 


Synechococcus sp. (strain JA-2-3B'a(2-13)) 


50-55 


[69] 


Q2JIP3 


7 


A 


Methanosaeta thermophila 


55-60 


[70] 


A0B9K9 


8 


B 


Thermosynechococcus elongatus 


57 


[71] 


Q8DIS5 


9 


B 


Clostridium clariflavum 


60 


[72] 


G8LZL2 


1 


B 


Bacillus acidocaldarius 


65 


[73] 


F8IID7 


2 


B 


Thermotoga lettingae 


65 


[74] 


A8F7V4 


3 


B 


Thermomicrobium roseum 


70 


[75] 


B9KXE5 


4 


A 


Archaeoglobus fulgidus 


76 


[76] 


028635 


5 


A 


Methanocaldococcus jannaschii 


85 


[77] 


Q58632 


6 


A 


Thermofilum pendens 


85-90 


[78] 


A1RZJ5 


7 


A 


Staphylothermus marinus 


85-92 


[79] 


A3DND9 


8 


A 


Pyrococcus horikoshii 


98 


[80] 


058677 


9 


A 


Pyrococcus furiosus 


100 


[81] 


Q8U1P9 



314 excluded; however, some that were tested were found to have considerably lower Zc. The selection of 

315 organisms was made in order to represent a variety of optimal growth temperatures (T opt ) as reported in 

316 the studies cited in the table [55-81]. 

317 A comparison between r opt and Zc of Rubisco is presented in figure 7. The Zc of Rubisco are somewhat 

318 higher than the bulk protein content of the organisms; compare for example the values for Pyrococcus 

319 horikoshii and P. furiosus (the highest-temperature points labelled 8 and 9 in figure la) with the range of 

320 values for "Pyro" in figure 6. At lower temperatures (0 to 50 °C), the differences between domains of life 

321 are most apparent; Rubiscos of the Bacteria in this sample set are more oxidized than those of Archaea 

322 and Eukaryota. There is a tendency for the Rubiscos of the Archaea to have lower Zc ; this appears to 

323 be characteristic of anaerobic methanogenesis and Form III Rubiscos. An interesting exception is the 

324 high-Zc Rubisco of Methanosaeta thermophila; this organism grows on acetate to produce both CH4 

325 and CO2 [70]. The major pattern that emerges is that higher temperatures are associated with a lower 

326 average oxidation state of carbon in proteins. As outlined below, a decrease in oxidation state of carbon 

327 in the covalent structure of the proteins confers energetic savings in hot, reducing environments. 
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Figure 7: (a) Average oxidation state of carbon in Rubisco compared with optimal growth temperature 
(Topt) of organisms. Numbers are used to identify the organisms (see table 3). (b-c) Gibbs energies of 
formation reactions, per residue, of selected Rubisco from organisms in the 23-30 °C range of optimal 
growth temperature (points labelled 4-7). (b) Total Gibbs energies of individual reactions as a function 
of Eh and (c) difference between the reaction for T. ferrooxidans and the others are shown. The grey 
highlight indicates the protein with the lowest AG along the range of Eh values. 



328 3.6 Beyond Zq: energetics of protein formation as a function of environmental redox 

329 potential 



330 To a first approximation, energetic considerations predict that more reducing conditions tend to favour 

331 formation of proteins with relatively lower Zc, and vice versa. To assess the directionality and magnitude 

332 of chemical forces on the evolutionary transformations of proteins, the energetics of reactions can be 

333 calculated using thermodynamic models. Because the timescales of evolution are much longer than 

334 transformations of biomolecules during metabolism, a discussion of the assumptions underlying the 

335 application of thermodynamic theory to biochemical evolution is warranted. 

336 The calculation of equilibrium provides a quantitative description of the state of a system in an energy 

337 minimum. The assumption of equilibrium is the foundation for many models of inorganic processes in 

338 geochemistry. Although the metabolic formation of proteins and other biochemical constituents proceeds 

339 in a non-equilibrium manner, using the equilibrium state as a frame of reference makes it possible to 

340 compare the energetics of living systems quantitatively ("how far from equilibrium"). 

341 In energetic terms, adaptation can be defined as a problem of optimal efficiency, with trade-offs between 

342 energy utilization and power [82, p. 140]. Overall minimization of biomass synthesis costs can be 

343 expected from the energy utilization standpoint. The links between energetics and evolutionary outcome 

344 implicitly depend on the proposal that greater fitness is associated with mutations that lower the synthesis 

345 costs of proteins for a given function (e.g. [83]). 

346 It has been argued previously that the propensity for some evolutionary changes can be modelled using 
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347 the related concepts of equilibration, energy minimization, and maximum entropy. In an evolution- 

348 ary context, these concepts have often been defined by analogy to their definitions in thermodynam- 

349 ics [82, 84]. The current discussion instead considers the possibility of direct application of chemical 

350 thermodynamics (the geochemical approach) to formulate a quantitative description of patterns of protein 

351 composition. The major assumption used in the following discussion is that energetic demands of pro- 

352 tein formation depend not only on the composition of the protein, but also the environmental conditions. 

353 Therefore, it is expected that environmental adaptation has left an imprint on chemical compositions of 

354 phylogenetically distinct proteins. 

355 An example calculation is carried out here for selected Rubiscos from organisms with optimal tempera- 

356 tures in the range of 23-30 °C (table 3, numbers 4-7). The basis species (representing inorganic starting 

357 materials) and their chemical activities used for this example are CO2 (10~ 3 ), H2O (1), NH3 (10~ 4 ), H2S 

358 (10~ 7 ) and H + (10~ 7 , i.e. pH = 7). The activity of the electron, representing the effects of the redox 

359 variable (Eh) through the Nernst equation, is left to vary. The Gibbs energies of the reactions to form 

360 the proteins were calculated as described previously [9, 19] and are plotted in figure lb. In figure lb it 

361 can be seen that the Gibbs energies of the reactions to form the proteins (AG), normalized by protein 

362 length, steadily increase (become less favourable) with increasing Eh, but the differences between the 

363 proteins can not be discerned easily. In figure 7c, the values of AG are shown relative to the reaction 

364 for T. ferrooxidans, giving a difference in Gibbs energy of formation (AAG) that can be used to assess 

365 the relative energetics of the reactions. Lower energies indicate the most stable protein (in the sense of 

366 chemical formation, not structural conformation), and these relative energies depend on the chemical 

367 conditions of the environment, measured in part by the oxidation-reduction potential. 

368 By using the Gibbs energy calculations such as shown in figure 7c, one can make an assessment of 

369 which protein in a given redox environment demands lower energy for overall synthesis (i.e. is more 

370 stable) than other possibilities. Where two lines cross in figure 7c, the energies to form the two proteins 

371 are equal, representing a metastable (partial) equilibrium. In metastable equilibrium, the coexistence of 

372 proteins with equal energies of formation corresponds to a local energy minimum. This interpretation 

373 does not preclude the non-equilibrium character of the overall biosynthetic process, because the energies 

374 of formation remain non-zero (figure lb). 

375 The three least costly proteins, going from low to high Eh, are those from M. burtonii, T. ferrooxidans, 

376 and Z. mays. Methanogenic archaea such as M. burtonii [58] inhabit reducing environments, where Eh 

377 values as low as at least -400 mV have been documented [2]. In contrast, the metal-leaching activity of 

378 T. ferrooxidans is associated with an increase in oxidation-reduction potential (ORP, which carries the 

379 same units as Eh) [85] and, compared to M. burtonii, its Rubisco is more oxidized and is consequently 

380 more stable at higher Eh. The correspondence between redox conditions and lower AG of formation of 

381 the proteins provides thermodynamic evidence for environmental adaptation of the proteins. 

382 The protein from T. ferrooxidans, which has the highest Zq of the four considered, does not have the 

383 lowest value of AAG in the most oxidizing (highest Eh) conditions. Instead, the Rubisco from Z. mays, 

384 even though it has lower Zc, is calculated to relatively more stable at the highest Eh values considered. 

385 This result shows that comparing values of Zc provides only an approximation of the dependence of the 

386 relative energetics of protein formation on redox changes; chemical thermodynamic models integrate 

387 more information about reaction stoichiometry and the effects of multiple environmental variables. 

388 Previously, comparison of Rubisco from hot-spring organisms adapted to higher temperatures revealed an 

389 increase in frequency hydrophobic amino acids, interpreted as increasing the conformational stability of 

390 proteins at high temperature [86]. However, another basis for interpretations of thermophilic adaptation 

391 depends on the relative energetic costs of synthesis of amino acids [87]. The finding made in this study 

392 is that the high-temperature Rubiscos exhibit a shift toward lower Zc (figure la). Higher temperatures 

393 are often associated with more reducing environments. For example, compared to seawater, activities 
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394 of dissolved hydrogen in hydrothermal fluids are higher, and mixed hydrothermal-seawater fluids have 

395 a reducing potential that favours formation of relatively reduced amino acids [8]. Redox potential is a 

396 major variable affecting the energetics of protein formation at different temperatures; therefore, adapta- 

397 tion to minimize biosynthetic costs in high-temperature environments is likely to have more than just a 

398 thermophilic (temperature dependent) aspect. 

399 Although the stoichiometric comparisons are distinct from sequence-based phylogenetic analyses which 

400 are used to test for positive selection, it is conceivable that Zc could in the future be incorporated into 

401 tree-based models of evolution that take account of physicochemical properties of amino acids in pro- 

402 teins [88]. However, as noted above, thermodynamic comparisons have greater power than compositional 

403 comparisons such as Zc. The computation of metastable equilibrium takes account simultaneously of 

404 temperature, redox potential (expressed as Eh, activity of hydrogen, or oxygen fugacity) and other vari- 

405 ables [19]. In another study, analysis of metagenomic and geochemical data led to a predicted metastable 

406 succession of proteins (with generally increasing Zc) that could be aligned with a gradient of increasing 

407 oxidation potential and decreasing temperature in a flowing hot spring [10]. When grouped by taxonomic 

408 similarity, the Zc in the hot-spring proteins, while becoming lower at high temperature, also spread over 

409 a broader range, leading to tighter constraints on the redox conditions suitable for metastable coexistence 

410 of the organisms [11]. 

411 Not only differences in the oxidation states of present-day environments, but also the oxygenation of 

412 Earth's atmosphere and oceans through geological time could have profound impacts on the energetics 

413 of biomass synthesis [7]. Adaptation to reduce these costs likely would lead to divergences in Zc that 

414 are apparent across different taxa, while closer phylogenetic relationships should confer a similarity in 

415 Zc- In common with the Rubiscos, comparison of the total proteins of microbes reveals a tendency for 

416 proteins in organisms associated with hot, as well as sulfidic and methanogenic environments, to be more 

417 reduced (figure 6). 



418 4 Conclusions 



419 Proteins are products of metabolism; their synthesis and degradation are part of the network of chemical 

420 reactions that sustains the living cell. Comparisons of the average oxidation state of carbon in pro- 

421 teins have provided a starting point for visualizing the compositional diversity of proteins in relation 

422 to redox chemistry in subcellular compartments and external environments. The large differences in 

423 Zc of proteins in locations such as the ER and cytoplasm likely have consequences for the dynamics of 

424 oxidation-reduction reactions involving glutathione and other metabolites. Further insight may be gained 

425 by including the formation and degradation of proteins in kinetic and stoichiometric models of metabolic 

426 networks. Extension of these concepts to other phenomena entailing changes in both redox potential and 

427 protein expression, such as stress response to oxidizing agents and the cell cycle, can be envisioned. The 

428 deepest significance of the observed patterns lies in their emergence over evolutionary timescales. The 

429 inverse trend relating Zc with standard reduction potentials in chains of redox-active proteins is a case 

430 where the chemical composition of the proteins may be tuned with the electron-transfer chemistry of the 

431 active sites. Compositional divergences among proteins are also apparent in phylogenetic comparisons, 

432 and here it is reasonable to conclude that correlations between oxidation state of carbon in proteins and 

433 the redox potential of the environment indicate some degree of energy savings conferred by evolution. 

434 The natural history of protein evolution is a result of processes that are both unpredictable (mutation 

435 events) and, to some extent, deterministic (selection for fitness in a given environment). By describing 

436 protein molecules in terms of chemical composition and energetics it will be possible to identify some 

437 of the forces that help to shape the occurrences of proteins in different cells and environments. 
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438 5 Supporting information 

439 The supporting information is provided in a ZIP archive containing the following code and data files: 

440 prep.R This file contains code used to prepare the data files for easier handling by the plotting functions. 

441 plot.R This file contains code used to make the figures appearing in this paper. The functions, in the 

442 order of the figures (1-7), are amino(), human(), yeast(), potential(), midpoint(), phylo(), and 

443 rubisco(). The code is written in R [18] and depends on version 1.0.2 of the CHNOSZ package 

444 [19], available from the Comprehensive R Archive network (http : / / cran . r-pro ject . org). 

445 data/SGD_associations.csv For yeast genes, this table lists the accessions, SGDID, and the associ- 

446 ation to cellular components in the Gene Ontology, derived from gene_association.sgd.gz, pro- 

447 tein_properties.tab and go_terms.tab downloaded from http : / / www . yeastgenome . org on 2013- 

448 08-24. All gene associations with the NOT qualifier were removed, as were those without a match- 

449 ing entry in protein_properties.tab (e.g. RNA-coding genes). 

450 data/ZC_HUMAN.csv, ZC_membrane.csv Compilations of the values of Zq for human proteins and 

451 human membrane proteins. Values in ZC_HUMAN.csv ware calculated from protein sequences 

452 in HUMAN.fasta.gz, downloaded from ftp:// ftp . uniprot . org/ pub/ databases/uniprot/ 

453 current_release/knowledgebase/proteomes/HUMAN.fasta.gz on 2013-08-24 (file dated 

454 2013-07-24). Values in ZC_membrane.csv were calculated from protein sequences in all *.fa 

455 files in Additional File 2 of Almen et al. , 2009 [27] . 

456 data/codons.csv In the first column, the three-letter abbreviations for each of the RNA codons; in the 

457 second column, the names of the corresponding amino acids. 

458 data/midpoint.csv List of protein names, UniProt IDs and standard midpoint reduction potentials used 

459 to make figure 5. Start and stop positions, taken from UniProt, identify the protein chain excluding 

460 initiator methionines or signal peptides. 

461 data/protein_refseq.csv Amino acid compositions of total proteins in 6758 microbial genomes from 

462 RefSeq release 61, dated 2013-09-09. The gene identifier (gi) numbers of the sequences were 

463 assigned taxonomic IDs (taxids) using the RefSeq release catalogue. The amino acid compositions 

464 of the total proteins were calculated by averaging the compositions of all proteins for each taxid. 

465 The "organism" column contains the taxid used in NCBI databases, the "ref" column contains the 

466 names of the RefSeq files from which the amino acid sequences were taken (with start and end 

467 positions in parentheses) followed by the scientific name of the organisms in brackets, and the 

468 "abbrv" column contains the number of amino acids for that organism. Scientific names for the 

469 taxids at the species level were found using the names.dmp and nodes.dmp files downloaded from 

470 ftp : //f tp .ncbi . nih. gov/pub/taxonomy/taxdump. tar . gz on 2013-09-18. 

471 data/rubisco.csv UniProt IDs for Rubisco and optimal growth temperatures of organisms (see table 3). 

472 cell/*.png PNG images for each of the cellular components used to make figure 3. 

473 fasta/midpoint/*.fasta FASTA sequence files for proteins shown in figure 5. 



474 fasta/rubisco/*.fasta FASTA sequence files for each Rubisco identified in table 3. 
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